The most dangerous “Borough”

First, I aggregated the amount of crashes for each “Borough”, and generated a plot for crash amount at each “Borough”, see the plot below:

## 
Read 17.6% of 1644156 rows
Read 44.4% of 1644156 rows
Read 70.6% of 1644156 rows
Read 96.7% of 1644156 rows
Read 1644155 rows and 29 (of 29) columns from 0.340 GB file in 00:00:06

As shown in the plot above, the most dangerous “Borough” is BROOKLYN .

Besides crash amount we can also use other metrics as indicators for how dangerous a “Borough” is. For instance, the injured person amount or the killed person amount, they also indicate how serious the accidents are.

However, the analysis procedure for such metrics, is similar, only with another metric. So I skip this analysis for now.

The worst place to have a citibyke station

In this section, I first calculated the density of bike crashes across the New York map. Then I visualized the densities using heatmap. On the heatmap, the bigger the crash density value, the deeper the red color is. The safer locations are colored with green.

(Note: This is my first time using map visualization, and I ran into some troubles for accessing map data from Google. For now, I use static map, and in future I will make such maps interactive/zoomable. )

The crash density map is shown as follows:

From the heatmap I choose the most dangerous location, and search for the closest stations for this location. By closest, I mean I measure the distance using the longitude and latitude gaps between the stations and the crash location. This metric can be a rough proxy estimate for physical distance, because we only focus on New York.

As shown in the plot above, the most dangerous station is E 58 St & 1 Ave (NW Corner) .

Model prediction for bike accident amount using location only

The 3rd question focus on predicting the crash amount using location only, and thus I use only latitude and longitude as the predictors.

(In futur, we can further this analysis to incoprate timing predictors, to show how crash amount evolve with year, season, month, day of week, hour, etc.)

I built a prediction model. During the modelling process, I used the accident density data as input, first I splitted the dataset into two parts: 70% of the data is used for training the model, and 30% of the data is used for testing whether the prediction accuracy is high. The reason for train-test data splitting is that, when we use the same data for model building and testing, we will run into the risk of overfitting: our models perfectly match the current dataset, but not generate accurate predictions for new data.

The first step after model building, is to testing whether the model generate accurate predictions on test dataset. The prediction errors are measured using both absolute values and percentage values.

For example, if a location has the actual crash amount as 100, and its predicted amount is 120, then its absolute prediction error is abs(100-120) = 20, while its percentage error is 20/100 = 20%.

The following plots shows the incidence of absolute errors and percentage errors.

As shown in the plots above, the model generate accurate predictions, most of errors are close to 0, big deviance only account for a tiny portion.

After accepting this model, I used the predicted accident amount on test dataset as the prediction results. In the plot below, I show the predicted crash amount across New York map.

(Note: in future, I can ramdomply sample over the whole map range for New York to generate more locations and generate predicted accident amount for those locations. For example, suppose the NewYork map range is latitude 40 to 50, longitude 70 to 80, I randomly sample locations within this range to generate locations as input for prediction models.)